[[tags: manual]]

== Unit srfi-13

SRFI 13 (string library).  Certain procedures contained in this SRFI,
such as {{string-append}}, are identical to R5RS versions and are
omitted from this document.  For full documentation, see the
[[http://srfi.schemers.org/srfi-13/srfi-13.html|original SRFI-13
document]].

On systems that support dynamic loading, the {{srfi-13}} unit can
be made available in the Chicken interpreter ({{csi}}) by entering

<enscript highlight=scheme>
(require-extension srfi-13)
</enscript>

The {{string-hash}} and {{string-hash-ci}} procedures are
not provided in this library unit.  [[Unit srfi-69]] has
compatible definitions.

[[toc:]]

== Notes

=== Strings are code-point sequences

This SRFI considers strings simply to be a sequence of "code points" or
character encodings. Operations such as comparison or reversal are always
done code point by code point.

Chicken's native strings are simple byte sequences (not Unicode code points).
Comparison or reversal is done byte-wise.  If Unicode semantics are
desired, see the [[utf8]] egg.

=== Case mapping and case-folding

Upper- and lower-casing characters is complex in super-ASCII encodings.
SRFI 13 makes no attempt to deal with these issues; it uses a simple 1-1
locale- and context-independent case-mapping, specifically Unicode's 1-1
case-mappings given in [[ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt]].

On Chicken, case-mapping is restricted to operate on ASCII characters.

=== String equality & string normalisation

SRFI 13 string equality is simply based upon comparing the encoding
values used for the characters.  On Chicken, strings are compared 
byte-wise.

=== String inequality

SRFI 13 string ordering is strictly based upon a
character-by-character comparison of the values used for representing
the string.

=== Naming conventions

* Procedures whose names end in "-ci" are case-insensitive variants. 
* Procedures whose names end in "!" are side-effecting variants. What values these procedures return is usually not specified. 
* The order of common parameters is consistent across the different procedures. 
* Left/right/both directionality: Procedures that have left/right directional variants use the following convention: 

<table>
<tr><th>Direction</th><th>Suffix</th></tr>
<tr><td>left-to-right</td><td>''none''</td></tr>
<tr><td>right-to-left</td><td>{{-right}}</td></tr>
<tr><td>both</td><td>{{-both}}</td></tr></table>

=== Shared storage

Chicken does not currently have shared-text substrings, nor does its
implementation of SRFI 13 routines ever return one of the
strings that was passed in as a parameter, as is allowed by the
specification.

On the other hand, the functionality is present to allow one to write
efficient code ''without'' shared-text substrings. You can write
efficient code that works by passing around start/end ranges indexing
into a string instead of simply building a shared-text substring.

== Procedure Specification

In the following procedure specifications:


* An S parameter is a string. 
* A CHAR parameter is a character. 
* START and END parameters are half-open string indices specifying a substring within a string parameter; when optional, they default to 0 and the length of the string, respectively. When specified, it must be the case that 0 <= START <= END <= {{(string-length S)}}, for the corresponding parameter S. They typically restrict a procedure's action to the indicated substring. 
* A PRED parameter is a unary character predicate procedure, returning a true/false value when applied to a character. 
* A CHAR/CHAR-SET/PRED parameter is a value used to select/search for a character in a string. If it is a character, it is used in an equality test; if it is a character set, it is used as a membership test; if it is a procedure, it is applied to the characters as a test predicate. 
* An I parameter is an exact non-negative integer specifying an index into a string. 
* LEN and NCHARS parameters are exact non-negative integers specifying a length of a string or some number of characters. 
* An OBJ parameter may be any value at all. 

Passing values to procedures with these parameters that do not satisfy
these types is an error.

Parameters given in square brackets are optional. Unless otherwise noted in
the text describing the procedure, any prefix of these optional parameters
may be supplied, from zero arguments to the full list. When a procedure
returns multiple values, this is shown by listing the return values in
square brackets, as well. So, for example, the procedure with signature


 halts? F [X INIT-STORE] -> [BOOLEAN INTEGER]

would take one (F), two (F, X) or three (F, X, INIT-STORE) input
parameters, and return two values, a boolean and an integer.

A parameter followed by "{{...}}" means zero-or-more elements. So the
procedure with the signature


 sum-squares X ...  -> NUMBER

takes zero or more arguments (X ...), while the procedure with signature


 spell-check DOC DICT_1 DICT_2 ... -> STRING-LIST


takes two required parameters (DOC and DICT_1) and zero or more optional
parameters (DICT_2 ...).

If a procedure is said to return "unspecified," this means that nothing
at all is said about what the procedure returns. Such a procedure is not
even required to be consistent from call to call. It is simply required to
return a value (or values) that may be passed to a command continuation,
''e.g.'' as the value of an expression appearing as a non-terminal
subform of a {{begin}} expression. Note that in R5RS, this restricts such
a procedure to returning a single value; non-R5RS systems may not even
provide this restriction.


=== Main procedures

==== Predicates

<procedure>(string-null? s) -> boolean</procedure><br>

Is S the empty string?

<procedure>(string-every char/char-set/pred s [start end]) -> value</procedure><br>
<procedure>(string-any char/char-set/pred s [start end]) -> value</procedure><br>

Checks to see if the given criteria is true of every / any character in S,
proceeding from left (index START) to right (index END).

If CHAR/CHAR-SET/PRED is a character, it is tested for equality with the
elements of S.

If CHAR/CHAR-SET/PRED is a character set, the elements of S are tested for
membership in the set.

If CHAR/CHAR-SET/PRED is a predicate procedure, it is applied to the
elements of S. The predicate is "witness-generating:"


* If {{string-any}} returns true, the returned true value is the one produced by the application of the predicate. 
* If {{string-every}} returns true, the returned true value is the one produced by the final application of the predicate to S[END-1]. If {{string-every}} is applied to an empty sequence of characters, it simply returns {{#t}}. 

If {{string-every}} or {{string-any}} apply the predicate to the final
element of the selected sequence (''i.e.'', S[END-1]), that final
application is a tail call.

The names of these procedures do not end with a question mark -- this is to
indicate that, in the predicate case, they do not return a simple boolean
({{#t}} or {{#f}}), but a general value.


==== Constructors

<procedure>(string-tabulate proc len) -> string</procedure><br>

PROC is an integer->char procedure. Construct a string of size LEN by
applying PROC to each index to produce the corresponding string element.
The order in which PROC is applied to the indices is not specified.


==== List & string conversion

<procedure>(string->list s [start end]) -> char-list</procedure><br>

{{string->list}} is extended from the R5RS definition to take optional
START/END arguments.

<procedure>(reverse-list->string char-list) -> string</procedure><br>

An efficient implementation of {{(compose list->string reverse)}}:


 (reverse-list->string '(#\a #\B #\c)) -> "cBa"

This is a common idiom in the epilog of string-processing loops
that accumulate an answer in a reverse-order list. (See also
{{string-concatenate-reverse}} for the "chunked" variant.)

<procedure>(string-join string-list [delimiter grammar]) -> string</procedure><br>

This procedure is a simple unparser --- it pastes strings together using
the delimiter string.

The GRAMMAR argument is a symbol that determines how the delimiter is used,
and defaults to {{'infix}}.


* {{'infix}} means an infix or separator grammar: insert the delimiter between list elements. An empty list will produce an empty string -- note, however, that parsing an empty string with an infix or separator grammar is ambiguous. Is it an empty list, or a list of one element, the empty string? 
* {{'strict-infix}} means the same as {{'infix}}, but will raise an error if given an empty list. 
* {{'suffix}} means a suffix or terminator grammar: insert the delimiter after every list element. This grammar has no ambiguities. 
* {{'prefix}} means a prefix grammar: insert the delimiter before every list element. This grammar has no ambiguities. 

The delimiter is the string used to delimit elements; it defaults to a
single space " ".


 (string-join '("foo" "bar" "baz") ":")         => "foo:bar:baz"
 (string-join '("foo" "bar" "baz") ":" 'suffix) => "foo:bar:baz:"
  
 ;; Infix grammar is ambiguous wrt empty list vs. empty string,
 (string-join '()   ":") => ""
 (string-join '("") ":") => ""
  
 ;; but suffix & prefix grammars are not.
 (string-join '()   ":" 'suffix) => ""
 (string-join '("") ":" 'suffix) => ":"



==== Selection

<procedure>(string-copy s [start end]) -> string</procedure><br>
<procedure>(substring/shared s start [end]) -> string</procedure><br>

[R5RS+] {{substring/shared}} returns a string whose contents are the
characters of S beginning with index START (inclusive) and ending with
index END (exclusive). It differs from the R5RS {{substring}} in two ways:


* The END parameter is optional, not required. 
* {{substring/shared}} may return a value that shares memory with S or is {{eq?}} to S. 

{{string-copy}} is extended from its R5RS definition by the addition of its
optional START/END parameters. In contrast to {{substring/shared}}, it is
guaranteed to produce a freshly-allocated string.

Use {{string-copy}} when you want to indicate explicitly in your code that
you wish to allocate new storage; use {{substring/shared}} when you don't
care if you get a fresh copy or share storage with the original string.


 (string-copy "Beta substitution") => "Beta substitution"
 (string-copy "Beta substitution" 1 10) 
     => "eta subst"
 (string-copy "Beta substitution" 5) => "substitution"

<procedure>(string-copy! target tstart s [start end]) -> unspecified</procedure><br>

Copy the sequence of characters from index range [START,END) in string
S to string TARGET, beginning at index TSTART. The characters are copied
left-to-right or right-to-left as needed -- the copy is guaranteed to work,
even if TARGET and S are the same string.

It is an error if the copy operation runs off the end of the target string,
''e.g.''


 (string-copy! (string-copy "Microsoft") 0
               "Regional Microsoft Operating Companies") => ''error''

<procedure>(string-take s nchars) -> string</procedure><br>
<procedure>(string-drop s nchars) -> string</procedure><br>
<procedure>(string-take-right s nchars) -> string</procedure><br>
<procedure>(string-drop-right s nchars) -> string</procedure><br>

{{string-take}} returns the first NCHARS of S; {{string-drop}} returns all
but the first NCHARS of S. {{string-take-right}} returns the last NCHARS
of S; {{string-drop-right}} returns all but the last NCHARS of S. If these
procedures produce the entire string, they may return either S or a copy of
S; in some implementations, proper substrings may share memory with S.


 (string-take "Pete Szilagyi" 6) => "Pete S"
 (string-drop "Pete Szilagyi" 6) => "zilagyi"
  
 (string-take-right "Beta rules" 5) => "rules"
 (string-drop-right "Beta rules" 5) => "Beta "

It is an error to take or drop more characters than are in the string:


 (string-take "foo" 37) => ''error''


<procedure>(string-pad s len [char start end]) -> string</procedure><br>
<procedure>(string-pad-right s len [char start end]) -> string</procedure><br>

Build a string of length LEN comprised of S padded on the left (right) by
as many occurrences of the character CHAR as needed. If S has more than LEN
chars, it is truncated on the left (right) to length LEN. CHAR defaults to
#\space.

If LEN <= END-START, the returned value is allowed to share storage with S,
or be exactly S (if LEN = END-START).


 (string-pad     "325" 5) => "  325"
 (string-pad   "71325" 5) => "71325"
 (string-pad "8871325" 5) => "71325"

<procedure>(string-trim s [char/char-set/pred start end]) -> string</procedure><br>
<procedure>(string-trim-right s [char/char-set/pred start end]) -> string</procedure><br>
<procedure>(string-trim-both s [char/char-set/pred start end]) -> string</procedure><br>

Trim S by skipping over all characters on the left / on the right / on both
sides that satisfy the second parameter CHAR/CHAR-SET/PRED:


* if it is a character CHAR, characters equal to CHAR are trimmed; 
* if it is a char set CS, characters contained in CS are trimmed; 
* if it is a predicate PRED, it is a test predicate that is applied to the characters in S; a character causing it to return true is skipped. 

CHAR/CHAR-SET/PRED defaults to the character set {{char-set:whitespace}}
defined in SRFI 14.

If no trimming occurs, these functions may return either S or a copy of S;
in some implementations, proper substrings may share memory with S.


 (string-trim-both "  The outlook wasn't brilliant,  \n\r")
     => "The outlook wasn't brilliant,"


==== Modification

<procedure>(string-fill! s char [start end]) -> unspecified</procedure><br>

[R5RS+] Stores CHAR in every element of S.

{{string-fill}} is extended from the R5RS definition to take optional
START/END arguments.


==== Comparison

<procedure>(string-compare s1 s2 proc< proc= proc> [start1 end1 start2 end2]) -> values</procedure><br>
<procedure>(string-compare-ci s1 s2 proc< proc= proc> [start1 end1 start2 end2]) -> values</procedure><br>

Apply PROC<, PROC=, or PROC> to the mismatch index, depending upon whether
S1 is less than, equal to, or greater than S2. The "mismatch index" is the
largest index I such that for every 0 <= J < I, S1[J] = S2[J] -- that is, I
is the first position that doesn't match.

{{string-compare-ci}} is the case-insensitive variant. Case-insensitive
comparison is done by case-folding characters with the operation


 (char-downcase (char-upcase C))

where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified by
Unicode's UnicodeData.txt table:

[[ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt]]

The optional start/end indices restrict the comparison to the indicated
substrings of S1 and S2. The mismatch index is always an index into S1;
in the case of PROC=, it is always END1; we observe the protocol in this
redundant case for uniformity.


 (string-compare "The cat in the hat" "abcdefgh" 
                 values values values
                 4 6         ; Select "ca" 
                 2 4)        ; & "cd"
     => 5    ; Index of S1's "a"

Comparison is simply done on individual code-points of the string. True
text collation is not handled by this SRFI.

<procedure>(string= s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string<> s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string< s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string> s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string<= s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string>= s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>

These procedures are the lexicographic extensions to strings of the
corresponding orderings on characters. For example, {{string<}} is the
lexicographic ordering on strings induced by the ordering {{char<?}} on
characters. If two strings differ in length but are the same up to the
length of the shorter string, the shorter string is considered to be
lexicographically less than the longer string.

The optional start/end indices restrict the comparison to the indicated
substrings of S1 and S2.

Comparison is simply done on individual code-points of the string. True
text collation is not handled by this SRFI.

<procedure>(string-ci= s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string-ci<> s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string-ci< s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string-ci> s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string-ci<= s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string-ci>= s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>

Case-insensitive variants.

Case-insensitive comparison is done by case-folding characters with the
operation


 (char-downcase (char-upcase C))

where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified by
Unicode's UnicodeData.txt table:

[[ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt]]

<procedure>(string-hash s [bound start end]) -> integer</procedure><br>
<procedure>(string-hash-ci s [bound start end]) -> integer</procedure><br>

Compute a hash value for the string S. BOUND is a non-negative exact
integer specifying the range of the hash function. A positive value
restricts the return value to the range [0,BOUND).

If BOUND is either zero or not given, the implementation may use an
implementation-specific default value, chosen to be as large as is
efficiently practical. For instance, the default range might be chosen for
a given implementation to map all strings into the range of integers that
can be represented with a single machine word.

The optional start/end indices restrict the hash operation to the indicated
substring of S.

{{string-hash-ci}} is the case-insensitive variant. Case-insensitive
comparison is done by case-folding characters with the operation


 (char-downcase (char-upcase C))

where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified by
Unicode's UnicodeData.txt table:

[[ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt]]

Invariants:


 (<= 0 (string-hash s b) (- b 1)) ; When B > 0.
 (string=    s1 s2)  =>  (= (string-hash s1 b)    (string-hash s2 b))
 (string-ci= s1 s2)  =>  (= (string-hash-ci s1 b) (string-hash-ci s2 b))

A legal but nonetheless discouraged implementation:


 (define (string-hash    s . other-args) 1)
 (define (string-hash-ci s . other-args) 1)

Rationale: allowing the user to specify an explicit bound simplifies user
code by removing the mod operation that typically accompanies every hash
computation, and also may allow the implementation of the hash function to
exploit a reduced range to efficiently compute the hash value. ''E.g.'',
for small bounds, the hash function may be computed in a fashion such
that intermediate values never overflow into bignum integers, allowing
the implementor to provide a fixnum-specific "fast path" for computing the
common cases very rapidly.


==== Prefixes & suffixes

<procedure>(string-prefix-length s1 s2 [start1 end1 start2 end2]) -> integer</procedure><br>
<procedure>(string-suffix-length s1 s2 [start1 end1 start2 end2]) -> integer</procedure><br>
<procedure>(string-prefix-length-ci s1 s2 [start1 end1 start2 end2]) -> integer</procedure><br>
<procedure>(string-suffix-length-ci s1 s2 [start1 end1 start2 end2]) -> integer</procedure><br>

Return the length of the longest common prefix/suffix of the two strings.
For prefixes, this is equivalent to the "mismatch index" for the strings
(modulo the STARTi index offsets).

The optional start/end indices restrict the comparison to the indicated
substrings of S1 and S2.

{{string-prefix-length-ci}} and {{string-suffix-length-ci}} are the
case-insensitive variants. Case-insensitive comparison is done by
case-folding characters with the operation


 (char-downcase (char-upcase c))

where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified by
Unicode's UnicodeData.txt table:

[[ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt]]

Comparison is simply done on individual code-points of the string.

<procedure>(string-prefix? s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string-suffix? s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string-prefix-ci? s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>
<procedure>(string-suffix-ci? s1 s2 [start1 end1 start2 end2]) -> boolean</procedure><br>

Is S1 a prefix/suffix of S2?

The optional start/end indices restrict the comparison to the indicated
substrings of S1 and S2.

{{string-prefix-ci?}} and {{string-suffix-ci?}} are the case-insensitive
variants. Case-insensitive comparison is done by case-folding characters
with the operation


 (char-downcase (char-upcase c))

where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified by
Unicode's UnicodeData.txt table:

[[ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt]]

Comparison is simply done on individual code-points of the string.


==== Searching

<procedure>(string-index s char/char-set/pred [start end]) -> integer or #f</procedure><br>
<procedure>(string-index-right s char/char-set/pred [start end]) -> integer or #f</procedure><br>
<procedure>(string-skip s char/char-set/pred [start end]) -> integer or #f</procedure><br>
<procedure>(string-skip-right s char/char-set/pred [start end]) -> integer or #f</procedure><br>

{{string-index}} ({{string-index-right}}) searches through the string
from the left (right), returning the index of the first occurrence of a
character which


* equals CHAR/CHAR-SET/PRED (if it is a character); 
* is in CHAR/CHAR-SET/PRED (if it is a character set); 
* satisfies the predicate CHAR/CHAR-SET/PRED (if it is a procedure). 

If no match is found, the functions return false.

The START and END parameters specify the beginning and end indices of the
search; the search includes the start index, but not the end index. Be
careful of "fencepost" considerations: when searching right-to-left, the
first index considered is

END-1

whereas when searching left-to-right, the first index considered is

START

That is, the start/end indices describe a same half-open interval
[START,END) in these procedures that they do in all the other SRFI 13
procedures.

The skip functions are similar, but use the complement of the criteria:
they search for the first char that ''doesn't'' satisfy the test. ''E.g.'',
to skip over initial whitespace, say


 (cond ((string-skip s char-set:whitespace) =>

        (lambda (i) ...)) ; s[i] is not whitespace.
       ...)

<procedure>(string-count s char/char-set/pred [start end]) -> integer</procedure><br>

Return a count of the number of characters in S that satisfy the
CHAR/CHAR-SET/PRED argument. If this argument is a procedure, it is applied
to the character as a predicate; if it is a character set, the character
is tested for membership; if it is a character, it is used in an equality
test.

<procedure>(string-contains s1 s2 [start1 end1 start2 end2]) -> integer or false</procedure><br>
<procedure>(string-contains-ci s1 s2 [start1 end1 start2 end2]) -> integer or false</procedure><br>

Does string S1 contain string S2?

Return the index in S1 where S2 occurs as a substring, or false. The
optional start/end indices restrict the operation to the indicated
substrings.

The returned index is in the range [START1,END1). A successful match must
lie entirely in the [START1,END1) range of S1.


 (string-contains "eek -- what a geek." "ee"
                  12 18) ; Searches "a geek"
     => 15

{{string-contains-ci}} is the case-insensitive variant. Case-insensitive
comparison is done by case-folding characters with the operation


 (char-downcase (char-upcase C))

where the two case-mapping operations are assumed to be 1-1, locale- and
context-insensitive, and compatible with the 1-1 case mappings specified by
Unicode's UnicodeData.txt table:

[[ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt]]

Comparison is simply done on individual code-points of the string.

The names of these procedures do not end with a question mark -- this is
to indicate that they do not return a simple boolean ({{#t}} or {{#f}}).
Rather, they return either false ({{#f}}) or an exact non-negative integer.


==== Alphabetic case mapping

<procedure>(string-titlecase s [start end]) -> string</procedure><br>
<procedure>(string-titlecase! s [start end]) -> unspecified</procedure><br>

For every character C in the selected range of S, if C is preceded by a
cased character, it is downcased; otherwise it is titlecased.

{{string-titlecase}} returns the result string and does not alter its S
parameter. {{string-titlecase!}} is the in-place side-effecting variant.


 (string-titlecase "--capitalize tHIS sentence.") =>
   "--Capitalize This Sentence."
  
 (string-titlecase "see Spot run. see Nix run.") =>
   "See Spot Run. See Nix Run."
  
 (string-titlecase "3com makes routers.") =>
   "3Com Makes Routers."

Note that if a START index is specified, then the character preceding
S[START] has no effect on the titlecase decision for character S[START]:


 (string-titlecase "greasy fried chicken" 2) => "Easy Fried Chicken"

Titlecase and cased information must be compatible with the Unicode
specification.

<procedure>(string-upcase s [start end]) -> string</procedure><br>
<procedure>(string-upcase! s [start end]) -> unspecified</procedure><br>
<procedure>(string-downcase s [start end]) -> string</procedure><br>
<procedure>(string-downcase! s [start end]) -> unspecified</procedure><br>

Raise or lower the case of the alphabetic characters in the string.

{{string-upcase}} and {{string-downcase}} return the result string and do
not alter their S parameter. {{string-upcase!}} and {{string-downcase!}}
are the in-place side-effecting variants.

These procedures use the locale- and context-insensitive 1-1 case mappings
defined by Unicode's UnicodeData.txt table:

[[ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt]]


==== Reverse & append

<procedure>(string-reverse s [start end]) -> string</procedure><br>
<procedure>(string-reverse! s [start end]) -> unspecified</procedure><br>

Reverse the string.

{{string-reverse}} returns the result string and does not alter its S
parameter. {{string-reverse!}} is the in-place side-effecting variant.


 (string-reverse "Able was I ere I saw elba.") 
     => ".able was I ere I saw elbA"
  
 ;;; In-place rotate-left, the Bell Labs way:
 (lambda (s i)
   (let ((i (modulo i (string-length s))))
     (string-reverse! s 0 i)
     (string-reverse! s i)
     (string-reverse! s)))

Unicode note: Reversing a string simply reverses the sequence of
code-points it contains. So a zero-width accent character A coming
''after'' a base character B in string S would come out ''before'' B in the
reversed result.

<procedure>(string-concatenate string-list) -> string</procedure><br>

Append the elements of {{string-list}} together into a single string.
Guaranteed to return a freshly allocated string.

Note that the {{(apply string-append STRING-LIST)}} idiom is not robust for
long lists of strings, as some Scheme implementations limit the number of
arguments that may be passed to an n-ary procedure.

<procedure>(string-concatenate/shared string-list) -> string</procedure><br>
<procedure>(string-append/shared s_1 ...) -> string</procedure><br>

These two procedures are variants of {{string-concatenate}} and
{{string-append}} that are permitted to return results that share storage
with their parameters. In particular, if {{string-append/shared}} is
applied to just one argument, it may return exactly that argument, whereas
{{string-append}} is required to allocate a fresh string.

<procedure>(string-concatenate-reverse string-list [final-string end]) -> string</procedure><br>
<procedure>(string-concatenate-reverse/shared string-list [final-string end]) -> string</procedure><br>

With no optional arguments, these functions are equivalent to


 (string-concatenate (reverse STRING-LIST))

and


 (string-concatenate/shared (reverse STRING-LIST))

respectively.

If the optional argument FINAL-STRING is specified, it is consed onto
the beginning of STRING-LIST before performing the list-reverse and
string-concatenate operations.

If the optional argument END is given, only the first END characters of
FINAL-STRING are added to the string list, thus producing


 (string-concatenate 
   (reverse (cons (substring/shared FINAL-STRING 0 END)
                  STRING-LIST)))


''E.g.''


 (string-concatenate-reverse '(" must be" "Hello, I") " going.XXXX" 7)
   => "Hello, I must be going."

This procedure is useful in the construction of procedures that accumulate
character data into lists of string buffers, and wish to convert the
accumulated data into a single string when done.

Unicode note: Reversing a string simply reverses the sequence of
code-points it contains. So a zero-width accent character AC coming
''after'' a base character BC in string S would come out ''before'' BC in
the reversed result.


==== Fold, unfold & map

<procedure>(string-map proc s [start end]) -> string</procedure><br>
<procedure>(string-map! proc s [start end]) -> unspecified</procedure><br>

PROC is a char->char procedure; it is mapped over S.

{{string-map}} returns the result string and does not alter its S
parameter. {{string-map!}} is the in-place side-effecting variant.

Note: The order in which PROC is applied to the elements of S is not
specified.

<procedure>(string-fold kons knil s [start end]) -> value</procedure><br>
<procedure>(string-fold-right kons knil s [start end]) -> value</procedure><br>

These are the fundamental iterators for strings.

The left-fold operator maps the KONS procedure across the string from left
to right


 (... (KONS S[2] (KONS S[1] (KONS S[0] KNIL))))


In other words, {{string-fold}} obeys the (tail) recursion


 (string-fold KONS KNIL S START END) =
     (string-fold KONS (KONS S[START] KNIL) START+1 END)


The right-fold operator maps the KONS procedure across the string from
right to left


 (KONS S[0] (... (KONS S[END-3] (KONS S[END-2] (KONS S[END-1] KNIL)))))


obeying the (tail) recursion


 (string-fold-right KONS KNIL S START END) =
     (string-fold-right KONS (KONS S[END-1] KNIL) START END-1)


Examples:


 ;;; Convert a string to a list of chars.
 (string-fold-right cons '() s)
  
 ;;; Count the number of lower-case characters in a string.
 (string-fold (lambda (c count)
                (if (char-lower-case? c)
                    (+ count 1)
                    count))
              0
              s)
  
 ;;; Double every backslash character in S.
 (let* ((ans-len (string-fold (lambda (c sum)
                                (+ sum (if (char=? c #\\) 2 1)))
                              0 s))
        (ans (make-string ans-len)))
   (string-fold (lambda (c i)
                  (let ((i (if (char=? c #\\)
                               (begin (string-set! ans i #\\) (+ i 1))
                               i)))
                    (string-set! ans i c)
                    (+ i 1)))
                0 s)
   ans)

The right-fold combinator is sometimes called a "catamorphism."

<procedure>(string-unfold p f g seed [base make-final]) -> string</procedure><br>

This is a fundamental constructor for strings.


* G is used to generate a series of "seed" values from the initial seed: SEED, (G SEED), (G^2 SEED), (G^3 SEED), ... 
* P tells us when to stop -- when it returns true when applied to one of these seed values. 
* F maps each seed value to the corresponding character in the result string. These chars are assembled into the string in a left-to-right order. 
* BASE is the optional initial/leftmost portion of the constructed string; it defaults to the empty string "". 
* MAKE-FINAL is applied to the terminal seed value (on which P returns true) to produce the final/rightmost portion of the constructed string. It defaults to {{(lambda (x) "")}}. 

More precisely, the following (simple, inefficient) definitions hold:


 ;;; Iterative
 (define (string-unfold p f g seed base make-final)
   (let lp ((seed seed) (ans base))
     (if (p seed) 
         (string-append ans (make-final seed))
         (lp (g seed) (string-append ans (string (f seed)))))))
                                     
 ;;; Recursive
 (define (string-unfold p f g seed base make-final)
   (string-append base
                  (let recur ((seed seed))
                    (if (p seed) (make-final seed)
                        (string-append (string (f seed))
                                       (recur (g seed)))))))

{{string-unfold}} is a fairly powerful string constructor -- you can use it
to convert a list to a string, read a port into a string, reverse a string,
copy a string, and so forth. Examples:


 (port->string p) = (string-unfold eof-object? values
                                   (lambda (x) (read-char p))
                                   (read-char p))
  
 (list->string lis) = (string-unfold null? car cdr lis)
  
 (string-tabulate f size) = (string-unfold (lambda (i) (= i size)) f add1 0)

To map F over a list LIS, producing a string:


 (string-unfold null? (compose f car) cdr lis)

Interested functional programmers may enjoy noting that
{{string-fold-right}} and {{string-unfold}} are in some sense inverses.
That is, given operations KNULL?, KAR, KDR, KONS, and KNIL satisfying


 (KONS (KAR x) (KDR x)) = x  and (KNULL? KNIL) = #t

then


 (string-fold-right KONS KNIL (string-unfold KNULL? KAR KDR X)) = X


and


 (string-unfold KNULL? KAR KDR (string-fold-right KONS KNIL S)) = S.


The final string constructed does not share storage with either BASE or the
value produced by MAKE-FINAL.

This combinator sometimes is called an "anamorphism."

Note: implementations should take care that runtime stack limits do not
cause overflow when constructing large (''e.g.'', megabyte) strings with
{{string-unfold}}.

<procedure>(string-unfold-right p f g seed [base make-final]) -> string</procedure><br>

This is a fundamental constructor for strings.


* G is used to generate a series of "seed" values from the initial seed: SEED, (G SEED), (G^2 SEED), (G^3 SEED), ... 
* P tells us when to stop -- when it returns true when applied to one of these seed values. 
* F maps each seed value to the corresponding character in the result string. These chars are assembled into the string in a right-to-left order. 
* BASE is the optional initial/rightmost portion of the constructed string; it defaults to the empty string "". 
* MAKE-FINAL is applied to the terminal seed value (on which P returns true) to produce the final/leftmost portion of the constructed string. It defaults to {{(lambda (x) "")}}. 

More precisely, the following (simple, inefficient) definitions hold:


 ;;; Iterative
 (define (string-unfold-right p f g seed base make-final)
   (let lp ((seed seed) (ans base))
     (if (p seed) 
         (string-append (make-final seed) ans)
         (lp (g seed) (string-append (string (f seed)) ans)))))
  
 ;;; Recursive
 (define (string-unfold-right p f g seed base make-final)
   (string-append (let recur ((seed seed))
                    (if (p seed) (make-final seed)
                        (string-append (recur (g seed))
                                       (string (f seed)))))
                  base))

Interested functional programmers may enjoy noting that {{string-fold}}
and {{string-unfold-right}} are in some sense inverses. That is, given
operations KNULL?, KAR, KDR, KONS, and KNIL satisfying

{{(KONS (KAR X) (KDR X))}} = X and {{(KNULL? KNIL)}} = #t

then


 (string-fold KONS KNIL (string-unfold-right KNULL? KAR KDR X)) = X


and


 (string-unfold-right KNULL? KAR KDR (string-fold KONS KNIL S)) = S.


The final string constructed does not share storage with either BASE or the
value produced by MAKE-FINAL.

Note: implementations should take care that runtime stack limits do not
cause overflow when constructing large (''e.g.'', megabyte) strings with
{{string-unfold-right.}}

<procedure>(string-for-each proc s [start end]) -> unspecified</procedure><br>

Apply PROC to each character in S. {{string-for-each}} is required to
iterate from START to END in increasing order.

<procedure>(string-for-each-index proc s [start end]) -> unspecified</procedure><br>

Apply PROC to each index of S, in order. The optional START/END pairs
restrict the endpoints of the loop. This is simply a method of looping over
a string that is guaranteed to be safe and correct. Example:


 (let* ((len (string-length s))
        (ans (make-string len)))
   (string-for-each-index
       (lambda (i) (string-set! ans (- len i) (string-ref s i)))
       s)
   ans)


==== Replicate & rotate

<procedure>(xsubstring s from [to start end]) -> string</procedure><br>

This is the "extended substring" procedure that implements replicated
copying of a substring of some string.

S is a string; START and END are optional arguments that demarcate a
substring of S, defaulting to 0 and the length of S (''i.e.'', the whole
string). Replicate this substring up and down index space, in both the
positive and negative directions. For example, if S = "abcdefg", START=3,
and END=6, then we have the conceptual bidirectionally-infinite string


<table>
<tr><td>...</td><td>d</td><td>e</td><td>f</td><td>d</td><td>e</td><td>f</td><td>d</td><td>e</td><td>f</td><td>d</td><td>e</td><td>f</td><td>d</td><td>e</td><td>f</td><td>d</td><td>e</td><td>f</td><td>d</td><td>...</td></tr>
<tr><td>...</td><td>-9</td><td>-8</td><td>-7</td><td>-6</td><td>-5</td><td>-4</td><td>-3</td><td>-2</td><td>-1</td><td>0</td><td>+1</td><td>+2</td><td>+3</td><td>+4</td><td>+5</td><td>+6</td><td>+7</td><td>+8</td><td>+9</td><td>...</td></tr></table>

{{xsubstring}} returns the substring of this string beginning at index
FROM, and ending at TO (which defaults to FROM+(END-START)).

You can use {{xsubstring}} to perform a variety of tasks:


* To rotate a string left: {{(xsubstring "abcdef" 2)}} => {{"cdefab"}} 
* To rotate a string right: {{(xsubstring "abcdef" -2)}} => {{"efabcd"}} 
* To replicate a string: {{(xsubstring "abc" 0 7)}} => {{"abcabca"}} 

Note that


* The FROM/TO indices give a half-open range -- the characters from index FROM up to, but not including, index TO. 
* The FROM/TO indices are not in terms of the index space for string S. They are in terms of the replicated index space of the substring defined by S, START, and END. 

It is an error if START=END -- although this is allowed by special
dispensation when FROM=TO.

<procedure>(string-xcopy! target tstart s sfrom [sto start end]) -> unspecified</procedure><br>

Exactly the same as {{xsubstring,}} but the extracted text is written into
the string TARGET starting at index TSTART. This operation is not defined
if {{(eq? TARGET S)}} or these two arguments share storage -- you cannot
copy a string on top of itself.


==== Miscellaneous: insertion, parsing

<procedure>(string-replace s1 s2 start1 end1 [start2 end2]) -> string</procedure><br>

Returns


 (string-append (substring/shared S1 0 START1)
                (substring/shared S2 START2 END2)
                (substring/shared S1 END1 (string-length S1)))


That is, the segment of characters in S1 from START1 to END1 is replaced by
the segment of characters in S2 from START2 to END2. If START1=END1, this
simply splices the S2 characters into S1 at the specified index.

Examples:


 (string-replace "The TCL programmer endured daily ridicule."
                 "another miserable perl drone" 4 7 8 22 ) =>
     "The miserable perl programmer endured daily ridicule."
  
 (string-replace "It's easy to code it up in Scheme." "lots of fun" 5 9) =>
     "It's lots of fun to code it up in Scheme."
  
 (define (string-insert s i t) (string-replace s t i i))
  
 (string-insert "It's easy to code it up in Scheme." 5 "really ") =>
     "It's really easy to code it up in Scheme."

<procedure>(string-tokenize s [token-set start end]) -> list</procedure><br>

Split the string S into a list of substrings, where each substring is a
maximal non-empty contiguous sequence of characters from the character set
TOKEN-SET.


* TOKEN-SET defaults to {{char-set:graphic}} (see SRFI 14 for more on character sets and {{char-set:graphic}}). 
* If START or END indices are provided, they restrict {{string-tokenize}} to operating on the indicated substring of S. 

This function provides a minimal parsing facility for simple applications.
More sophisticated parsers that handle quoting and backslash effects can
easily be constructed using regular-expression systems; be careful not to
use {{string-tokenize}} in contexts where more serious parsing is needed.


 (string-tokenize "Help make programs run, run, RUN!") =>
   ("Help" "make" "programs" "run," "run," "RUN!")


==== Filtering & deleting

<procedure>(string-filter char/char-set/pred s [start end]) -> string</procedure><br>
<procedure>(string-delete har/char-set/pred s [start end]) -> string</procedure><br>

Filter the string S, retaining only those characters that satisfy / do not
satisfy the CHAR/CHAR-SET/PRED argument. If this argument is a procedure,
it is applied to the character as a predicate; if it is a char-set, the
character is tested for membership; if it is a character, it is used in an
equality test.

If the string is unaltered by the filtering operation, these functions may
return either S or a copy of S.


=== Low-level procedures

The following procedures are useful for writing other string-processing
functions. In a Scheme system that has a module or package system, these
procedures should be contained in a module named "string-lib-internals".


==== Start/end optional-argument parsing & checking utilities

<procedure>(string-parse-start+end proc s args) -> [rest start end]</procedure><br>
<procedure>(string-parse-final-start+end proc s args) -> [start end]</procedure><br>

{{string-parse-start+end}} may be used to parse a pair of optional
START/END arguments from an argument list, defaulting them to 0 and the
length of some string S, respectively. Let the length of string S be SLEN.


* If ARGS = (), the function returns {{(values '() 0 SLEN)}} 
* If ARGS = (I), I is checked to ensure it is an exact integer, and that 0 <= i <= SLEN. Returns {{(values (cdr ARGS) I SLEN)}}. 
* If ARGS = {{(I J ...)}}, I and J are checked to ensure they are exact integers, and that 0 <= I <= J <= SLEN. Returns {{(values (cddr ARGS) I J)}}. 

If any of the checks fail, an error condition is raised, and PROC is used
as part of the error condition -- it should be the client procedure whose
argument list {{string-parse-start+end}} is parsing.

{{string-parse-final-start+end}} is exactly the same, except that the args
list passed to it is required to be of length two or less; if it is longer,
an error condition is raised. It may be used when the optional START/END
parameters are final arguments to the procedure.

Note that in all cases, these functions ensure that S is a string (by
necessity, since all cases apply {{string-length}} to S either to default
END or to bounds-check it).

<procedure>(let-string-start+end (start end [rest]) proc-exp s-exp args-exp body ...) -> value(s)</procedure><br>

[Syntax] Syntactic sugar for an application of {{string-parse-start+end}}
or {{string-parse-final-start+end.}}

If a REST variable is given, the form is equivalent to


 (call-with-values
     (lambda () (string-parse-start+end PROC-EXP S-EXP ARGS-EXP))
   (lambda (REST START END) BODY ...))


If no REST variable is given, the form is equivalent to


 (call-with-values
     (lambda () (string-parse-final-start+end PROC-EXP S-EXP ARGS-EXP))
   (lambda (START END) BODY ...))


<procedure>(check-substring-spec proc s start end) -> unspecified</procedure><br>
<procedure>(substring-spec-ok? s start end) -> boolean</procedure><br>

Check values S, START and END to ensure they specify a valid substring.
This means that S is a string, START and END are exact integers, and 0 <=
START <= END <= {{(string-length S)}}

If the values are not proper


* {{check-substring-spec}} raises an error condition. PROC is used as part of the error condition, and should be the procedure whose parameters we are checking. 
* {{substring-spec-ok?}} returns false. 

Otherwise, {{substring-spec-ok?}} returns true, and
{{check-substring-spec}} simply returns (what it returns is not specified).


==== Knuth-Morris-Pratt searching

The Knuth-Morris-Pratt string-search algorithm is a method of rapidly
scanning a sequence of text for the occurrence of some fixed string. It has
the advantage of never requiring backtracking -- hence, it is useful for
searching not just strings, but also other sequences of text that do not
support backtracking or random-access, such as input ports. These routines
package up the initialisation and searching phases of the algorithm for
general use. They also support searching through sequences of text that
arrive in buffered chunks, in that intermediate search state can be
carried across applications of the search loop from the end of one buffer
application to the next.

A second critical property of KMP search is that it requires the allocation
of auxiliary memory proportional to the length of the pattern, but
''constant'' in the size of the character type. Alternate searching
algorithms frequently require the construction of a table with an entry for
every possible character -- which can be prohibitively expensive in a 16-
or 32-bit character representation.

<procedure>(make-kmp-restart-vector s [c= start end]) -> integer-vector</procedure><br>

Build a Knuth-Morris-Pratt "restart vector," which is useful for quickly
searching character sequences for the occurrence of string S (or the
substring of S demarcated by the optional START/END parameters, if
provided). C= is a character-equality function used to construct the
restart vector. It defaults to {{char=?}}; use {{char-ci=?}} instead for
case-folded string search.

The definition of the restart vector RV for string S is: If we have matched
chars 0..I-1 of S against some search string SS, and S[I] doesn't match
SS[K], then reset I := RV[I], and try again to match SS[K]. If RV[I] = -1,
then punt SS[K] completely, and move on to SS[K+1] and S[0].

In other words, if you have matched the first I chars of S, but the I+1'th
char doesn't match, RV[I] tells you what the next-longest prefix of S is
that you have matched.

The following string-search function shows how a restart vector is used to
search. Note the attractive feature of the search process: it is "on line,"
that is, it never needs to back up and reconsider previously seen data. It
simply consumes characters one-at-a-time until declaring a complete match
or reaching the end of the sequence. Thus, it can be easily adapted to
search other character sequences (such as ports) that do not provide random
access to their contents.


 (define (find-substring pattern source start end)
   (let ((plen (string-length pattern))
         (rv (make-kmp-restart-vector pattern)))
  
     ;; The search loop. SJ & PJ are redundant state.
     (let lp ((si start) (pi 0)
              (sj (- end start))     ; (- end si)  -- how many chars left.
              (pj plen))             ; (- plen pi) -- how many chars left.
  
       (if (= pi plen) (- si plen)                   ; Win.
  
           (and (<= pj sj)                           ; Lose.
  
                (if (char=? (string-ref source si)           ; Test.
                            (string-ref pattern pi))
                    (lp (+ 1 si) (+ 1 pi) (- sj 1) (- pj 1)) ; Advance.
  
                    (let ((pi (vector-ref rv pi)))           ; Retreat.
                      (if (= pi -1)
                          (lp (+ si 1)  0   (- sj 1)  plen)  ; Punt.
                          (lp si        pi  sj        (- plen pi))))))))))

The optional START/END parameters restrict the restart vector to the
indicated substring of PAT; RV is END - START elements long. If START >
0, then RV is offset by START elements from PAT. That is, RV[I] describes
pattern element PAT[I + START]. Elements of RV are themselves indices that
range just over [0, END-START), ''not'' [START, END).

Rationale: the actual value of RV is "position independent" -- it does
not depend on where in the PAT string the pattern occurs, but only on the
actual characters comprising the pattern.

<procedure>(kmp-step pat rv c i c= p-start) -> integer</procedure><br>

This function encapsulates the work performed by one step of the KMP string
search; it can be used to scan strings, input ports, or other on-line
character sources for fixed strings.

PAT is the non-empty string specifying the text for which we are searching.
RV is the Knuth-Morris-Pratt restart vector for the pattern, as constructed
by {{make-kmp-restart-vector.}} The pattern begins at PAT[P-START], and
is {{(string-length RV)}} characters long. C= is the character-equality
function used to construct the restart vector, typically {{char=?}} or
{{char-ci=?}}.

Suppose the pattern is N characters in length: PAT[P-START, P-START +
N). We have already matched I characters: PAT[P-START, P-START + I).
(P-START is typically zero.) C is the next character in the input stream.
{{kmp-step}} returns the new I value -- that is, how much of the pattern
we have matched, ''including'' character C. When I reaches N, the entire
pattern has been matched.

Thus a typical search loop looks like this:


 (let lp ((i 0))
   (or (= i n)                           ; Win -- #t
       (and (not (end-of-stream))        ; Lose -- #f
            (lp (kmp-step pat rv (get-next-character) i char=? 0)))))

Example:


 ;; Read chars from IPORT until we find string PAT or hit EOF.
 (define (port-skip pat iport)
   (let* ((rv (make-kmp-restart-vector pat))
          (patlen (string-length pat)))
     (let lp ((i 0) (nchars 0))
       (if (= i patlen) nchars                    ; Win -- nchars skipped
           (let ((c (read-char iport)))
             (if (eof-object? c) c                ; Fail -- EOF
                 (lp (kmp-step pat rv c i char=? 0) ; Continue
                     (+ nchars 1))))))))

This procedure could be defined as follows:


 (define (kmp-step pat rv c i c= p-start)
   (let lp ((i i))
     (if (c= c (string-ref pat (+ i p-start)))     ; Match =>
         (+ i 1)                                   ;   Done.
         (let ((i (vector-ref rv i)))              ; Back up in PAT.
           (if (= i -1) 0                          ; Can't back up more.
               (lp i)))))))                        ; Keep going.

Rationale: this procedure takes no optional arguments because it is
intended as an inner-loop primitive and we do not want any run-time penalty
for optional-argument parsing and defaulting, nor do we wish barriers to
procedure integration/inlining.

<procedure>(string-kmp-partial-search pat rv s i [c= p-start s-start s-end]) -> integer</procedure><br>

Applies {{kmp-step}} across S; optional S-START/S-END bounds parameters
restrict search to a substring of S. The pattern is {{(vector-length RV)}}
characters long; optional P-START index indicates non-zero start of pattern
in PAT.

Suppose PLEN = {{(vector-length RV)}} is the length of the pattern. I is an
integer index into the pattern (that is, 0 <= I < PLEN) indicating how much
of the pattern has already been matched. (This means the pattern must be
non-empty -- PLEN > 0.)


* On success, returns -J, where J is the index in S bounding the ''end'' of the pattern -- ''e.g.'', a value that could be used as the END parameter in a call to {{substring/shared}}. 
* On continue, returns the current search state I' (an index into RV) when the search reached the end of the string. This is a non-negative integer. 

Hence:


* A negative return value indicates success, and says where in the string the match occured. 
* A non-negative return value provides the I to use for continued search in a following string. 

This utility is designed to allow searching for occurrences of a fixed
string that might extend across multiple buffers of text. This is why,
for example, we do not provide the index of the ''start'' of the match on
success -- it may have occurred in a previous buffer.

To search a character sequence that arrives in "chunks," write a loop of
this form:


 (let lp ((i 0))
   (and (not (end-of-data?))             ; Lose -- return #f.
        (let* ((buf (get-next-chunk))    ; Get or fill up the buffer.
               (i (string-kmp-partial-search pat rv buf i)))
          (if (< i 0) (- i)              ; Win -- return end index.
              (lp i)))))                 ; Keep looking.

Modulo start/end optional-argument parsing, this procedure could be defined
as follows:


 (define (string-kmp-partial-search pat rv s i c= p-start s-start s-end)
   (let ((patlen (vector-length rv)))
     (let lp ((si s-start)       ; An index into S.
              (vi i))            ; An index into RV.
       (cond ((= vi patlen) (- si))      ; Win.
             ((= si end) vi)             ; Ran off the end.
             (else (lp (+ si 1)          ; Match s[si] & loop.
                       (kmp-step pat rv (string-ref s si)
                                 vi c= p-start)))))))

----

Previous: [[Unit srfi-4]]

Next: [[Unit srfi-14]]
